BBGÀgora - Advent of code

Install conda
Add conda channels
Install jupyter notebook
Create a conda environment
Register to github
Clone a github repository
Register to Advent of Code
Add a new folder and create your notebook
Python list comprehension
Python iterables and iterators
Python generators
Usage of CSV and ZIP python package to manipulate big TSV files
Fast (and beautiful) command line creation

Anaconda distribution

Conda is a package managment system. It's language agnostic, you can install Python, R and many other tools.

Tabix

   conda create -n tabix -c bioconda htslib

Bedtools

   conda create -n bedtools -c bioconda bedtools

R

   conda create -n r -c r r

What is a conda "channel"?

It's a repository of conda packages

Check your channels:

conda config --show

Check this URL: https://bioconda.github.io/

conda config --add channels conda-forge
conda config --add channels defaults
conda config --add channels r
conda config --add channels bioconda
conda config --add channels bbglab

What is a conda "environment"?

It's only a change on the PATH

echo $PATH
source activate tabix
echo $PATH

ll ~/anaconda3/envs/tabix

Github

Register to github
Add your RSA public key to settings cat ~/.ssh/id_rsa.pub (If you don't have RSA key run first ssh-keygen)
Clone advent of code repository somewhere git clone git@github.com:bbglab/adventofcode.git

What is a RSA public-private key?

What is a Git repository?

ll .git
git log
ll .git/objects/7d

Jupyter notebook

Check that you have jupyter notebook installed at the "root" environment

conda install jupyter notebook

Create an environment for the Advent of Code project

conda create -n adventofcode python=3.5 ipykernel

Create a folder like:

mkdir 2016/jordi

Run jupyter notebook

jupyter notebook

What is a Jupyter Kernel?

It's an independent process with his own environment variables.

Start some notebooks and check the running processes

ps -AF | grep kernel

Python list comprehension

List comprehensions are a tool for transforming one list (any iterable actually) into another list. During this transformation, elements can be conditionally included in the new list and each element can be transformed as needed.



In [6]:

    
[n * 2 for n in range(10) if n % 2 == 1]









    Out[6]:





[2, 6, 10, 14, 18]



In [7]:

    
# Also a dict
{n: n * 2 for n in range(10) if n % 2 == 1}









    Out[7]:





{1: 2, 3: 6, 5: 10, 7: 14, 9: 18}



In [10]:

    
# Or a set
{n * 2 for n in range(10) if n % 2 == 1}









    Out[10]:





{2, 6, 10, 14, 18}

Python iterables and iterators



In [11]:

    
# ITERABLE: Anything that you can use in a for is an iterable
for a in [1,2,3]:
    print(a)



In [14]:

    
# ITERATOR: A iteration of an iterable
list_iterator = iter([1,2,3])
print(next(list_iterator))
print(next(list_iterator))
print(next(list_iterator))



In [15]:

    
list_iterator = iter([1,2,3])
print(next(list_iterator))
print(next(list_iterator))
print(next(list_iterator))
print(next(list_iterator))









    



1
2
3






    



---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
<ipython-input-15-11a54e9f9f0b> in <module>()
      3 print(next(list_iterator))
      4 print(next(list_iterator))
----> 5 print(next(list_iterator))

StopIteration:

Python generators



In [19]:

    
[n * 2 for n in range(10) if n % 2 == 1]









    Out[19]:





[2, 6, 10, 14, 18]



In [20]:

    
# Convert a list comprehension to a generator comprehension
generator = (n * 2 for n in range(10) if n % 2 == 1)
generator









    Out[20]:





<generator object <genexpr> at 0x7f18904e5f10>



In [21]:

    
iterator = iter(generator)
next(iterator)









    Out[21]:





2



In [22]:

    
def odd_double(size=10):
    for n in range(size):
        if n % 2 == 1:
            yield n*2



In [24]:

    
generator = odd_double()
iterator = iter(generator)
next(iterator)









    Out[24]:





2



In [26]:

    
list(odd_double(15))









    Out[26]:





[2, 6, 10, 14, 18, 22, 26]

Manage big tsv files



In [1]:

    
import bgdata, csv, gzip, os, pandas
from pprint import pprint

domains = os.path.expanduser('~/tmp/domains.tsv.gz')
# If you want to test use this: 
# domains = os.path.join(bgdata.get_path('tcgi', 'oncodrivemut', '1.1'), 'ensembl75_pfam_domain_coordinates.tsv.gz')



In [2]:

    
%%time
df = pandas.read_csv(domains, sep='\t')
result = df[df['Ensembl Gene ID'] == 'ENSG00000261258'].head(1).to_dict(orient='records')
pprint(result)
print('\n')









    



[{'Ensembl Gene ID': 'ENSG00000261258',
  'Ensembl Transcript ID': 'ENST00000566592',
  'HGNC symbol': 'FKBP10',
  'Pfam ID': 'PF13202',
  'Pfam domain': 'EF-hand_5',
  'Pfam end': 456.0,
  'Pfam start': 437.0}]


CPU times: user 8.34 s, sys: 168 ms, total: 8.5 s
Wall time: 8.51 s



In [5]:

    
%%time
with gzip.open(domains, 'rt') as fd:
    for r in csv.DictReader(fd, delimiter='\t'):
        if r['Ensembl Gene ID'] == 'ENSG00000261258':
            pprint(r)
            print('\n')
            break









    



{'Ensembl Gene ID': 'ENSG00000261258',
 'Ensembl Transcript ID': 'ENST00000566592',
 'HGNC symbol': 'FKBP10',
 'Pfam ID': 'PF13202',
 'Pfam domain': 'EF-hand_5',
 'Pfam end': '456.0',
 'Pfam start': '437.0'}


CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 3.44 ms



In [8]:

    
%%time
with gzip.open(domains, 'rt') as fd:
    reader = csv.reader(fd, delimiter='\t')
    header = next(reader)
    for r in reader:
        if r[0] == 'ENSG00000261258':
            pprint({h: v for h,v in zip(header, r)})
            print('\n')
            break









    



{'Ensembl Gene ID': 'ENSG00000261258',
 'Ensembl Transcript ID': 'ENST00000566592',
 'HGNC symbol': 'FKBP10',
 'Pfam ID': 'PF13202',
 'Pfam domain': 'EF-hand_5',
 'Pfam end': '456.0',
 'Pfam start': '437.0'}


CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 2.58 ms



In [9]:

    
%%time
with gzip.open(domains, 'rt') as fd:
    header = next(fd).split('\t')
    
    reader = csv.reader((l for l in fd if l.startswith('ENSG00000261258')), delimiter='\t')
    for r in reader:
        if r[0] == 'ENSG00000261258':
            pprint({h: v for h,v in zip(header, r)})
            print('\n')
            break









    



{'Ensembl Gene ID': 'ENSG00000261258',
 'Ensembl Transcript ID': 'ENST00000566592',
 'HGNC symbol': 'FKBP10',
 'Pfam ID': 'PF13202',
 'Pfam domain\n': 'EF-hand_5',
 'Pfam end': '456.0',
 'Pfam start': '437.0'}


CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 1.64 ms

Command line creation

http://click.pocoo.org/5/

import click

@click.command()
@click.option('--count', default=1, help='Number of greetings.')
@click.option('--name', prompt='Your name',
              help='The person to greet.')
def hello(count, name):
    """Simple program that greets NAME for a total of COUNT times."""
    for x in range(count):
        click.echo('Hello %s!' % name)

if __name__ == '__main__':
    hello()

And what it looks like when run:

$ python hello.py --count=3
Your name: John
Hello John!
Hello John!
Hello John!

Advent of Code

Santa's sleigh uses a very high-precision clock to guide its movements, and the clock's oscillator is regulated by stars. Unfortunately, the stars have been stolen... by the Easter Bunny. To save Christmas, Santa needs you to retrieve all fifty stars by December 25th.

Collect stars by solving puzzles. Two puzzles will be made available on each day in the advent calendar; the second puzzle is unlocked when you complete the first. Each puzzle grants one star. Good luck!

Solve your first puzzle!



In [ ]: